Identification of metabolic pathways contributing to ER+ breast cancer disparities using a machine-learning pipeline

African American (AA) women in the United States have a 40% higher breast cancer mortality rate than Non-Hispanic White (NHW) women. The survival disparity is particularly striking among (estrogen receptor positive) ER+ breast cancer cases. The purpose of this study is to examine whether there are racial differences in metabolic pathways typically activated in patients with ER+ breast cancer. We collected pretreatment plasma from AA and NHW ER+ breast cancer cases (AA n = 48, NHW n = 54) and cancer-free controls (AA n = 100, NHW n = 48) to conduct an untargeted metabolomics analysis using gas chromatography mass spectrometry (GC–MS) to identify metabolites that may be altered in the different racial groups. Unpaired t-test combined with multiple feature selection and prediction models were employed to identify race-specific altered metabolic signatures. This was followed by the identification of altered metabolic pathways with a focus in AA patients with breast cancer. The clinical relevance of the identified pathways was further examined in PanCancer Atlas breast cancer data set from The Cancer Genome Atlas Program (TCGA). We identified differential metabolic signatures between NHW and AA patients. In AA patients, we observed decreased circulating levels of amino acids compared to healthy controls, while fatty acids were significantly higher in NHW patients. By mapping these metabolites to potential epigenetic regulatory mechanisms, this study identified significant associations with regulators of metabolism such as methionine adenosyltransferase 1A (MAT1A), DNA Methyltransferases and Histone methyltransferases for AA individuals, and Fatty acid Synthase (FASN) and Monoacylglycerol lipase (MGL) for NHW individuals. Specific gene Negative Elongation Factor Complex E (NELFE) with histone methyltransferase activity, was associated with poor survival exclusively for AA individuals. We employed a comprehensive and novel approach that integrates multiple machine learning and statistical methods, coupled with human functional pathway analyses. The metabolic profile of plasma samples identified may help elucidate underlying molecular drivers of disproportionately aggressive ER+ tumor biology in AA women. It may ultimately lead to the identification of novel therapeutic targets. To our knowledge, this is a novel finding that describes a link between metabolic alterations and epigenetic regulation in AA breast cancer and underscores the need for detailed investigations into the biological underpinnings of breast cancer health disparities.


Methods
Study population. African American (AA) and Non-Hispanic White women (NHW) aged 20-79 years, with a new diagnosis of (American Joint Committee of Cancer) AJCC stage I-III, ER + breast cancer were recruited from 3 hospitals in Chicago, IL between 2018 and 2019. This study was approved by the University of Illinois, Chicago Institutional Review Board (IRB protocol number 2017-1029). An informed consent was obtained from all subjects. All methods were performed in accordance with the relevant guidelines and regulations. Healthy AA and NHW women without breast symptoms or a personal history of breast cancer who presented for a screening mammogram at corresponding mammography centers were recruited as control subjects.
Only control subjects whose screening mammogram was negative (BI-RADS category [1][2] were included in the analytic study population. Non-fasting whole blood was collected from AA and NHW breast cancer cases prior to breast surgery or any other breast cancer treatment. Control subjects donated a blood specimen at the time of study enrollment. Demographic and tumor pathologic characteristics are described in Table 1. Whole blood collected in serum separator tubes was allowed to clot for 40 min at room temperature, and plasma was separated by centrifugation within 1 h of collection (1680g for 10 min at room temperature in a horizontal rotor, tabletop centrifuge). Aliquots of plasma were immediately frozen at − 20 °C, and transferred to − 80 °C within 4 weeks. All samples were analyzed after a single freeze-thaw cycle.

Metabolomics analysis. Sample preparation.
To precipitate protein, 200 µL of thawed plasma was mixed with 1 mL isopropanol, acetonitrile, H2O 3:3:2 mix. The mixture was vortexed for 10 s and kept at − 20 °C for further processing. After spinning down, 600 µL of the supernatant were transferred to a separate Eppendorf tube and submitted to UIUC metabolomics core for a GC/MS whole metabolite profiling.
Mass spectrometry. A gas chromatography/mass spectrometry (GC/MS) based metabolic profiling was performed to detect and quantify the metabolites in plasma. Supernatants were collected by centrifugation (5 min at 15,000g), dried and derivatized with 50 μL of 40 mg/mL methoxyamine hydrochloride in pyridine (Sigma-Aldrich, MO, USA) for 60 min at 50 °C, then with 50 μL MSTFA + 1%TMCS (Thermo, MA, USA) at 70 °C for 120 min, and following 2-h incubation at room temperature. 50 μL of the internal standard (hentriacontanoic acid, 1 mg/mL) was added to each sample before derivatization. Metabolite profiles were acquired using a GC-MS system (Agilent Inc, CA, USA) consisting of an Agilent 7890 gas chromatograph, an Agilent 5975 MSD and an HP 7683B autosampler. Gas chromatography was performed on a ZB-5MS (60 m × 0.32 mm ID and 0.25 μm film thickness) capillary column (Phenomenex, CA, USA). The inlet and MS interface temperatures were 250 °C, and the ion source temperature was adjusted to 230 °C. An aliquot of 1 μL was injected with the split ratio of 10:1. The helium carrier gas was kept at a 2 mL/min constant flow rate. The temperature program was: 5-min isothermal heating at 70 °C, followed by an oven temperature increase of 5 °C/min to 310 °C and a final 10 min at 310 °C. The mass spectrometer was operated in positive electron impact mode (EI) at 69.9 eV ionization energy at m/z 30-800 scan range.
Spectra and data analysis. The spectra of all chromatogram peaks were evaluated using the AMDIS 2.71 (NIST, MD, USA) software using a custom-built database with 460 unique metabolites. All known artificial www.nature.com/scientificreports/ peaks were identified and removed before data mining. All data were normalized to the internal standard hentriacontanoic acid, 1 mg/mL in each chromatogram, and the cell's dry weight (DW) to compare samples. The instrument variability was within the standard acceptance limit (5%).
Data processing. As part of data preprocessing, metabolites missing over 40% of data were removed.
Remaining missing data values were completed using a trained data imputation algorithm using the available data. The imputation performance was tested one metabolite after another by removing one column and doing imputation on them. The imputation performance was evaluated for each specific metabolite considering R square metric. This process was repeated for each metabolite in the data set, removing and imputing them individually. Metabolites with R 2 < 0.3, that indicated poor imputation performance were removed. To explore which metabolites (these are referred as features) are indicators of cancer and in each group, the data was sorted as African American (AA) data set and Non-Hispanic White (NHW) data set.

Statistical analyses.
In order to compare the metabolite levels between cases and controls in each race group, we performed unpaired t-tests between cases and controls for each metabolic feature, separated by race groups. Differential metabolites were identified by adjusting the p values for multiple testing at a False discovery Rate (FDR) threshold of 0.05. A (*) indicates p value (p < 0.05) considered statistically significant. p value is reported as two-tailed p value, and standard error is included. We explored the performance of each feature selection and classification performance method by comparing the AUC values obtained using each method combination. A complete diagram of the pipeline used is described in Fig. 1. Predictive approaches used included: Decision Tree, Random Forest, Logistic Regression, and Support Vector Machine (SVM). Boruta, caret, tidyverse, readxl, e1071, pROC, tree, randomForest R packages were used. Parameters used on each test are reported in Table 2. www.nature.com/scientificreports/ Performance evaluation. Performance metric statistics were gathered, including Mathews Correlation Coefficient (MCC) and F1 score, as well as receiver operator characteristic and area under the curve (ROC, AUC) curves for each combination of feature selection and prediction method to visualize classification performance (true positive rate vs. false positive rate). We used the receiver operating characteristic curve (ROC) for each feature selection and prediction method as a quantitative performance metric. We further analyzed the performance of the combination methods by reporting F score statistics and MCC (Table 3). Once selected the best feature selection/prediction model combination, the top features of importance were identified. Their relative influence on each model (in %), in terms of Mean Decrease Gini were computed. These features were further explored for pathway analysis.

Pathway analysis.
To identify the most relevant pathways involved in each set (AA and NHW) a pathway analysis was performed using Metaboanalyst 4.0 Pathway Analysis Module. This module combines pathway enrichment analysis with pathway topology analysis. It uses high-quality KEGG metabolic pathways as the backend knowledge base and integrates many well-established (i.e., univariate analysis, over-representation analysis) methods, as well as novel algorithms and concepts (i.e., Global Test, GlobalAncova, network topology analysis) into pathway analysis. This analysis was limited to all metabolic features selected with the combination of Boruta test feature selection and Random Forest classification and that were consistent in the rest of the feature selection methods (Table 4A,B).

TCGA cohort mRNA and survival analyses.
To gain molecular insights on the pathways altered in AA vs NHW individuals, the different abundances in AA vs NHW patients with breast cancer were mapped manually. Unpaired t-test was performed with Welch's correction. To test the association of these metabolic genes with survival in AA breast cancer, we performed an analysis of the PanCancer atlas TCGA Cohort. The cohort was selected based on a cohort that contained information regarding self-reported race, and subtype. Cohort included AA cases (n = 82) and NHW cases (510) of ER+ breast cancer, that contained mRNA, protein expression and clinical outcomes.

Results
Race-specific plasma metabolic signatures. Study population included 102 patients with breast cancer (AA, n = 48; NHW, n = 54) and 148 healthy women as controls (AA, n = 100; NHW, n = 48). Demographic and clinical characteristics are shown in Table 1. Following data processing with minimum observed values for each metabolite, 83 metabolites were included in the analysis. For statistical purposes, we report statistical differences in p value < 0.05 obtained by t-tests. (Table 5).
Aminoacids are good predictors of ER+ breast cancer in African American patients, while fatty acids are good predictors in Non-Hispanic White patients. We compared feature selection and prediction combinations instead of relying on only one machine learning method as previously described 39 . Boruta/ Random Forest combination was the most accurate, or best combination of feature selection/prediction model for AA (AUC = 0.79, MCC = 0.56) and NHW (AUC = 0.78, MCC = 0.52), according to the Matthews correlation coefficient (MCC) (Fig. 2). This optimal method identified 16 features that discriminate between cases and controls in both races. In AA individuals, a total of 10 features contributed to the signature to distinguish between case and controls and 9 of these features were exclusive for AA patients. These metabolites were: alpha ketoisocaproic acid, arginine, alpha tocopherol, citric acid, histidine, maltose, methionine, n-acetylglutamic acid, o-phosphoethanolamine, and oxalic acid. In AA individuals with breast cancer, we observed higher plasma levels of alpha ketoisocaproic acid a ketoacid, product of branched-chain aminoacid metabolism, alpha tocopherol oxalic acid, and citric acid, when compared to the healthy control group. In contrast, we observed lower arginine, histidine, maltose, methionine, n-acetyl glutamic acid, and o-phosphoethanolamine levels (Fig. 3A). In NHW individuals, a total of six features were identified as part of a signature that discriminates between case and controls: β-hydroxybutyrate, cholesterol, oxalic acid, palmitic acid, palmitoleic acid, and tetra decanoic acid. In NHW, we observed higher levels of all six metabolites in cases when compared with race-matched controls, and five of these features were exclusive for NHW patients (Fig. 3B,C). Overall, each of these differentially abundant features contributed to the metabolic signature significantly (Fig. 4A,B). www.nature.com/scientificreports/ In order to evaluate the association of differential metabolite levels and tumor grade, a total of 148 breast cancer samples (with known tumor grade 1, 2, 3), were compared with 102 healthy controls. We grouped grades I and II breast cancer patients as "low grade" and grade III as "high grade" similar to previous studies [40][41][42] . We evaluated differential levels of all metabolites (n = 15) that were good predictors of breast cancer in Fig. 4A,B. Eight metabolites were found to be significantly changed among healthy controls and two grades of breast cancer (Fig. 4C).

Key metabolic pathways activated in ER + breast cancer patients differ by race. Pathway-based
analysis was performed to explore metabolic pathway differences according to race in breast cancer cases. For AA individuals, metabolites that are associated with aminoacyl-tRNA biosynthesis, arginine metabolism (p < 0.05), branched amino acid metabolism (p = 0.05), and histidine metabolism (p = 0.09) were differentially abundant in plasma from individuals with breast cancer (Fig. 5A). Other pathways including, TCA cycle, beta-alanine metabolism, sphingolipid metabolism, alanine, aspartate, and glutamate metabolism glyoxylate and dicarboxylate metabolism, cysteine and methionine metabolism, glycerophospholipid metabolism, arginine and proline metabolism, were not statistically significant, but were identified in the global pathway analysis.
In NHW individuals, three significant pathways were identified including fatty acid metabolism (p < 0.05), ketone body metabolism (p < 0.05), and butanoate metabolism (p = 0.05). Pathways like unsaturated fatty acid synthesis, fatty acid elongation and degradation, steroid metabolism, primary bile acid synthesis and steroid hormone synthesis were not statistically significant but were identified in the global pathway analysis (Fig. 5B). One potential explanation for this is that there is a higher production of free fatty acids coming from de novo lipogenesis or from triacylglycerol lipolysis. Therefore, we analyzed mRNA levels of key enzymes responsible for the generation of free fatty acids in tumors from AA vs NHW patients with ER+ breast cancer in TCGA. Expression of fatty acid synthase and Monoglycerol lipase was higher in tumors from NHW patients compared with AA patients (Fig. 5C,D). www.nature.com/scientificreports/

Metabolic pathway analysis suggests a role for epigenetic regulation differences that result in worse survival in African-American breast cancer patients.
In AA women with breast cancer, we observed lower methionine (Met) levels, which led us to hypothesize that there is a trend of higher demand for Met consumption and metabolism. One potential explanation for the utilization of Met is that it is being used as substrate for methylation processes. Therefore, we analyzed samples from Pan Cancer Atlas cohort (TCGA) for mRNA levels of MAT1A (methionine adenosyl transferase 1A) and methyltransferases (DNA and Histone) in AA (n = 183) compared with NHW individuals (n = 757). MAT1A is responsible for catalyzing the reaction from methionine to SAM. SAM then is used as a substrate for methyltransferases. We hypothesized and found that AA individuals have higher mRNA levels of methyl transferases and MAT1A. We identified 200 genes with histone methylation activity, and 71 genes with DNA methylation activity that were differentially expressed between AA and NHW individuals. Out of the 271 genes studied, 49 had increased expression in AA individuals, 15 of these were statistically significant (Fig. 6A). To obtain additional insights, we further investigated the associations of these genes with survival in AA and NHW breast cancer. A Kaplan Meier survival analysis of NELFE showed that high expression was associated with poor survival in AA women, but it was not associated with survival among NHW patients (Fig. 6B-D) Together these findings suggest that this gene may play role in the biology of breast cancer in AA individuals.

Discussion
In this study, we established blood metabolite profiles of 102 patients with breast cancer and 148 healthy controls and we observed that metabolic profiles differed by race. We found that levels of amino acids were significantly lower in AA patients with breast cancer than healthy controls. One possible explanation might be the high demand for amino acids in breast tumor metabolism. Tumor cells require amino acids as alternative fuels, and as precursors of biosynthetic materials, including those for DNA synthesis, for building new blood vessels and supporting their rapid growth and proliferation. The alterations in these metabolic pathways cause differential activation of downstream signaling that further translate to altered oncogenic regulation. Breast cancer is associated with marked metabolic shifts. However, these shifts in metabolism within tumors can vary between stages, subtypes, and race 25,43,44 Most differences involve variation in amino acid, fatty acids and TCA intermediates. Previous studies have shown that pathways associated with energy metabolism: glycolysis, amino acid metabolism and TCA cycle, are dominant in AA women with ER + tumors, which potentially indicates the aggressiveness of their tumors 44,45 . Additionally, changes in metabolite levels affect gene and protein expression by altering substrate availability of epigenetic modifiers that mark to the DNA, RNA, and histones, ultimately dictating cancer progression and outcome [46][47][48] . Previous studies 19-21, 49-51 and studies from our lab [52][53][54][55] showed how systemic or cellular metabolic  www.nature.com/scientificreports/ changes impact epigenetic landscape, which is important for ERα activity and response to clinical drugs. Amino acids provide metabolic intermediates for epigenetic regulation. For example, carbon units from methionine cycle serve as a methyl donor for DNA and histone methyltransferases (DNMTs and HMTs). HMTs catalyze the transfer of methyl groups to lysine and arginine residues in histone proteins. Depending on the type of residue being modified, histone methylation regulates activation or repression of gene expression. In AA women, poverty levels correlate with hypermethylation of cancer-associated pathways including glucocorticoid receptor, p53, estrogen dependent breast cancer signaling, and cell proliferation 56 . Therefore, hypermethylation is a possible biological mechanism that can explain worse outcomes in AA women with breast cancer conferred by living in low socioeconomical status (SES) neighborhoods. We found that the methionine levels are lower in plasma samples from AA women with breast cancer. Methionine is a methyl group donor for methylation and represents the major contributor for epigenetic regulation. Both DNMTs and HMTs use S-adenosylmethionine (SAM) as a methyl donor, which is generated in the methionine cycle by the action of methionine adenolsyltransferase (MAT). Increased uptake of methionine leads to an excess of SAM, resulting in hypermethylation and aberrant histone methylation. Aberrant hypermethylation in DNA can occur in promoter and enhancer regions of cancer related genes, such as tumor suppressor genes, resulting in loss of expression 57 . These abnormalities are involved in oncogenesis and tumor progression. Many cancer suppressor genes are silenced by DNA methylation in tissues 58 . A study in 2019 reported that residing in residential areas with high poverty may impact DNA methylation patterns. The authors reported that in AA women, poverty levels affected hypermethylation of important pathways including glucocorticoid receptor, p53, estrogen dependent breast cancer signaling, cell proliferation (BCL2, JUN, ESR1, ESR2, CYP19A1) 56 . Therefore, hypermethylation is a candidate biological mechanism that may contribute to worse outcomes in AA women with breast cancer because of a higher likelihood of residing in low SES neighborhoods.
Besides its function as methyl donor, SAM is involved in polyamine synthesis. Polyamines are essential for cell growth, and enzymes that catalyze this synthesis are often overexpressed in cancer. Studies have shown that polyamines can alter gene expression by modulating global chromatin structure and ultimately affecting cell www.nature.com/scientificreports/ proliferation 59,60 . Aminoacyl tRNAs (aa-tRNA) biosynthesis associated metabolites were enriched in AA individuals with breast cancer. The production of aminoacyl-tRNA is directly associated with the protein synthesis process. The main role of aa-tRNAs is to deliver the amino acid to the ribosome for incorporation into the polypeptide chain that is being produced during translation. Moreover, we observed that the level of fatty acids in NHW patients with breast cancer was higher compared with healthy controls or AA individuals. These results might suggest higher rates of de novo lipogenesis or triacylglycerol lipolysis in the tumor. Highly proliferative cells use citrate produced in TCA cycle to generate fatty acids in the cytosol. The principal fatty acid synthesis enzyme, fatty acid synthase (FASN), is often upregulated in tumors 61 . Moreover, FASN has been detected in cell lysate and supernatant of breast cancer cells derived from NHW patients 62 . The regulation of de novo lipogenesis occurs mainly at the transcriptional level through the activation of sterol regulatory element binding proteins (SREBP). Its active N terminal fragment translocate to the nucleus and induces the transcription of gene that contain sterol regulatory elements (SRE) such as lipogenic enzymes (fatty acid synthase) FAS, ACLY (ATP citrate lyase) and ACC (acetyl coA carboxylase). In addition to its function in de novo lipogenesis, FASN is part of crosstalk between PI3K-AKT, MAPK-ERK2 pathways and their interaction with estradiol-ERα. The overexpression of lipogenic enzymes mentioned above, have been observed early in tumorigenesis. In ER+ cancer models, both genetic and pharmacological inhibition of FASN, hypersensitizes ERα to estrogen dependent transactivation, inducing estrogen receptor element (ERE) transcriptional activity and MAPK-ERK signaling 63 . Our results are consistent with previously published metabolomics studies that show fatty acids levels are higher in plasma from patients with breast cancer. In addition, our group previously reported that higher levels of free fatty acids in plasma were an indicator of high breast cancer risk, and result in mTOR and MAPK signaling and ERα recruitment to chromatin to increase transcriptional activity of factors that regulate cancer cell metabolism 53 . Consistently, other studies also reported a role for free fatty acids in other cancer types including lung 64 childhood tumors 65 , and colon cancer 66 . Recent studies using biological deuterium fractionation and discrimination points out diet as the main source of increased fatty acid pool in plasma, which is delivered to cells via circulation. Thus, fatty acids act as the intermediate proton carrying carbon source for mitochondrial respiration. Further, generation of ketones using these deuterium-depleted fatty acids might explain benefit of ketogenic diets. Food insecurity and inequality in food quality and availability resulting in metabolic inefficiency might contribute to differential fatty acid profiles in AA vs. NHW women and breast cancer disparities.
One weakness of our study is the data imbalance. When data imbalance occurs, the number of samples in one class is larger than number of samples in the other class, and we can encounter an overoptimistic estimation of the classifier ability on the class with majority number of samples, such as our case (for AA individuals we have 100 healthy controls vs 48 cancer cases). To overcome this, we use the MCC score since produces a high score only if the prediction obtained good results in all the four confusion matrix categories (true positives, false negatives, true negatives, and false positives, Table 3). MCC score is a more reliable statistical rate and performance metric for binary classification when dealing with imbalanced data in binary data as it has been reported previously 67,68 . www.nature.com/scientificreports/ In conclusion, our metabolic and bioinformatics analyses provided valuable information regarding metabolic alterations associated with cancer within each race, and provide insights on potential metabolic vulnerabilities that should be considered for depth studies of tumor biology that interrogate the role of candidate oncometabolites in breast cancer disparities, which may inform novel therapeutic development and biologically-informed epidemiologic studies to provide greater insight into biological mechanisms underlying racial disparities in breast cancer survival.

Data availability
The datasets generated and analyzed during the current study will be available from the corresponding author on reasonable request.